Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support collect and display the heartbeat thread status #17473

Open
wants to merge 7 commits into
base: master-2.x
Choose a base branch
from

Conversation

maobaolong
Copy link
Contributor

@maobaolong maobaolong commented May 23, 2023

What changes are proposed in this pull request?

Support collect and display the heartbeat thread status

Why are the changes needed?

With this feature, we can insight the background heartbeat thread status and detect the issue before encountered an accident.

Does this PR introduce any user facing changes?

http://localhost:19999/api/v1/master/webui_heartbeat_threads
http://localhost:30000/api/v1/worker/webui_heartbeat_threads
http://localhost:20002/api/v1/job_master/webui_heartbeat_threads
http://localhost:30003/api/v1/job_worker/webui_heartbeat_threads

{
"debug": false,
"heartbeatThreadInfos": [
{
"count": 1,
"threadName": "Master Block Integrity Check",
"previousReport": "#0 [05-23-2023 09:59:13:190 - 05-23-2023 09:59:13:190 - 05-23-2023 09:59:13:209] ticked(s) 0, run(s) 0.",
"startTickTime": "05-23-2023 09:59:13:209",
"startHeartbeatTime": "",
"status": "WAITING"
},
{
"count": 3,
"threadName": "Master Cluster Metrics Updater",
"previousReport": "#2 [05-23-2023 10:00:13:164 - 05-23-2023 10:01:14:884 - 05-23-2023 10:01:19:769] ticked(s) 61, run(s) 4.",
"startTickTime": "05-23-2023 10:01:19:771",
"startHeartbeatTime": "",
"status": "WAITING"
},
{
"count": 1,
"threadName": "Master Log Config Report Scheduling",
"previousReport": "#0 [05-23-2023 09:59:13:255 - 05-23-2023 09:59:13:255 - 05-23-2023 09:59:13:255] ticked(s) 0, run(s) 0.",
"startTickTime": "05-23-2023 09:59:13:256",
"startHeartbeatTime": "",
"status": "WAITING"
},
{
"count": 1,
"threadName": "Master Lost Files Detection",
"previousReport": "#0 [05-23-2023 09:59:13:195 - 05-23-2023 09:59:13:195 - 05-23-2023 09:59:13:195] ticked(s) 0, run(s) 0.",
"startTickTime": "05-23-2023 09:59:13:195",
"startHeartbeatTime": "",
"status": "WAITING"
},
{
"count": 2,
"threadName": "Master Lost Master Detection",
"previousReport": "#1 [05-23-2023 09:59:13:251 - 05-23-2023 10:01:14:881 - 05-23-2023 10:01:19:715] ticked(s) 121, run(s) 4.",
"startTickTime": "05-23-2023 10:01:19:723",
"startHeartbeatTime": "",
"status": "WAITING"
},
{
"count": 3,
"threadName": "Master Lost Proxy Detection",
"previousReport": "#2 [05-23-2023 10:00:13:260 - 05-23-2023 10:01:14:882 - 05-23-2023 10:01:19:723] ticked(s) 61, run(s) 4.",
"startTickTime": "05-23-2023 10:01:19:723",
"startHeartbeatTime": "",
"status": "WAITING"
},
{
"count": 13,
"threadName": "Master Lost Worker Detection",
"previousReport": "#12 [05-23-2023 10:01:14:881 - 05-23-2023 10:01:19:713 - 05-23-2023 10:01:19:716] ticked(s) 4, run(s) 0.",
"startTickTime": "05-23-2023 10:01:19:729",
"startHeartbeatTime": "",
"status": "WAITING"
},
{
"count": 1,
"threadName": "Master Metrics Time Series",
"previousReport": "#0 [05-23-2023 09:59:13:214 - 05-23-2023 09:59:13:214 - 05-23-2023 09:59:13:232] ticked(s) 0, run(s) 0.",
"startTickTime": "05-23-2023 09:59:13:234",
"startHeartbeatTime": "",
"status": "WAITING"
},
{
"count": 110,
"threadName": "Master Persistence Checker",
"previousReport": "#109 [05-23-2023 10:01:25:462 - 05-23-2023 10:01:27:833 - 05-23-2023 10:01:27:836] ticked(s) 2, run(s) 0.",
"startTickTime": "05-23-2023 10:01:27:846",
"startHeartbeatTime": "",
"status": "WAITING"
},
{
"count": 110,
"threadName": "Master Persistence Scheduler",
"previousReport": "#109 [05-23-2023 10:01:27:830 - 05-23-2023 10:01:27:842 - 05-23-2023 10:01:27:846] ticked(s) 0, run(s) 0.",
"startTickTime": "05-23-2023 10:01:27:846",
"startHeartbeatTime": "",
"status": "WAITING"
},
{
"count": 3,
"threadName": "Master Replication Check",
"previousReport": "#2 [05-23-2023 10:00:13:212 - 05-23-2023 10:01:14:885 - 05-23-2023 10:01:14:885] ticked(s) 61, run(s) 0.",
"startTickTime": "05-23-2023 10:01:15:462",
"startHeartbeatTime": "",
"status": "WAITING"
},
{
"count": 1,
"threadName": "Master TTL Check",
"previousReport": "#0 [05-23-2023 09:59:13:192 - 05-23-2023 09:59:13:192 - 05-23-2023 09:59:13:193] ticked(s) 0, run(s) 0.",
"startTickTime": "05-23-2023 09:59:13:194",
"startHeartbeatTime": "",
"status": "WAITING"
},
{
"count": 0,
"threadName": "Master Throttle",
"previousReport": null,
"startTickTime": "05-23-2023 09:59:13:269",
"startHeartbeatTime": "05-23-2023 09:59:13:269",
"status": "STOPPED"
},
{
"count": 1,
"threadName": "Worker register stream session cleaner",
"previousReport": "#0 [05-23-2023 09:59:13:130 - 05-23-2023 09:59:13:130 - 05-23-2023 09:59:13:130] ticked(s) 0, run(s) 0.",
"startTickTime": "05-23-2023 09:59:13:134",
"startHeartbeatTime": "",
"status": "WAITING"
}
]
}

Base on this, we can create a web ui form for this heartbeat thread info.

@ChunxuTang
Copy link
Member

Thanks for the work! Would you mind fixing the tests?

@dbw9580
Copy link
Contributor

dbw9580 commented Jun 9, 2023

pending review

Copy link
Contributor

@jiacheliu3 jiacheliu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments, PTAL thanks!

mExecutor.heartbeat(limitTime);
long endHeartbeatTime = System.currentTimeMillis();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +189 to +190
mStartTickTime = 0L;
mStartHeartbeatTime = 0L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why reset to 0 every time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 stand for that it is not start to heartbeat this period.

Comment on lines 41 to 46
List<HeartbeatThread> list = HEARTBEAT_THREAD_INDEX_MAP.get(key);
if (list == null) {
list = new LinkedList<>();
HEARTBEAT_THREAD_INDEX_MAP.put(key, list);
}
HEARTBEAT_THREAD_MAP.put(heartbeatThread.getThreadName(), heartbeatThread);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map.computeIfAbsent(key, (k) -> new ArrayList<>());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 29 to 32
private static final Map<String, HeartbeatThread> HEARTBEAT_THREAD_MAP
= new ConcurrentHashMap<>();
private static final Map<Object, List<HeartbeatThread>> HEARTBEAT_THREAD_INDEX_MAP
= new ConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you use synchronized methods so no need for ConcurrentHashMap here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

@Override
public boolean equals(Object o) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do you need equals() and hashCode()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am lazy copy the style from existing code, I just reference the file under alluxio/core/common/src/main/java/alluxio/wire, like AlluxioWorkerInfo BlockInfo and LogInfo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you can remove equals and hashCode

mExecutor.heartbeat(limitTime);
long endHeartbeatTime = System.currentTimeMillis();
mPreviousReport = String.format("#%d [%s - %s - %s] ticked(s) %d, run(s) %d.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't create report on run(), you may update the states but generate reports on demand

@maobaolong
Copy link
Contributor Author

@jiacheliu3 Thanks for your review, I've addressed your comments, PTAL~

Copy link
Contributor

@jiacheliu3 jiacheliu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work! I added some comments for discussion.

mExecutor.heartbeat(limitTime);
mEndHeartbeatTime = CommonUtils.getCurrentMs();
updateState();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every tick you are resetting to zero? looks like a bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it looks need a structure to store the previous and current state

}

@Override
public boolean equals(Object o) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you can remove equals and hashCode

public final class WebUIHeartbeatThreads implements Serializable {
private static final long serialVersionUID = -2903043308252679410L;

private boolean mDebug;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove it

return RestUtils.call(() -> {
WebUIHeartbeatThreads response = new WebUIHeartbeatThreads();

response.setDebug(Configuration.getBoolean(PropertyKey.DEBUG));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the output be different if you set PropertyKey.DEBUG? I dont think this is relevant?

*/
public static Future<?> submit(ExecutorService executorService,
HeartbeatThread heartbeatThread) {
HeartbeatThreadManager.register(executorService, heartbeatThread);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think it's a good idea to use a thread pool as key...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what can be the better key? Thread name?

* @return the heartbeat threads info
*/
public static synchronized Map<String, HeartbeatThreadInfo> getHeartbeatThreads() {
SortedMap<String, HeartbeatThreadInfo> heartbeatThreads = new TreeMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double check if the thread is still alive, and make sure this ref is removed if the thread is already dead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants